Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 35
Filtrar
Más filtros










Base de datos
Intervalo de año de publicación
1.
Elife ; 122023 02 27.
Artículo en Inglés | MEDLINE | ID: mdl-36847334

RESUMEN

Predicting the function of a protein from its amino acid sequence is a long-standing challenge in bioinformatics. Traditional approaches use sequence alignment to compare a query sequence either to thousands of models of protein families or to large databases of individual protein sequences. Here we introduce ProteInfer, which instead employs deep convolutional neural networks to directly predict a variety of protein functions - Enzyme Commission (EC) numbers and Gene Ontology (GO) terms - directly from an unaligned amino acid sequence. This approach provides precise predictions which complement alignment-based methods, and the computational efficiency of a single neural network permits novel and lightweight software interfaces, which we demonstrate with an in-browser graphical interface for protein function prediction in which all computation is performed on the user's personal computer with no data uploaded to remote servers. Moreover, these models place full-length amino acid sequences into a generalised functional space, facilitating downstream analysis and interpretation. To read the interactive version of this paper, please visit https://google-research.github.io/proteinfer/.


Asunto(s)
Algoritmos , Redes Neurales de la Computación , Proteínas/genética , Proteínas/química , Secuencia de Aminoácidos , Programas Informáticos , Biología Computacional/métodos
2.
Nat Biotechnol ; 41(8): 1073-1074, 2023 Aug.
Artículo en Inglés | MEDLINE | ID: mdl-36702894
3.
Methods Mol Biol ; 2586: 49-77, 2023.
Artículo en Inglés | MEDLINE | ID: mdl-36705898

RESUMEN

Here we detail the LandscapeFold secondary structure prediction algorithm and how it is used. The algorithm was previously described and tested in (Kimchi O et al., Biophys J 117(3):520-532, 2019), though it was not named there. The algorithm directly enumerates all possible secondary structures into which up to two RNA or single-stranded DNA sequences can fold. It uses a polymer physics model to estimate the configurational entropy of structures including complex pseudoknots. We detail each of these steps and ways in which the user can adjust the algorithm as desired. The code is available on the GitHub repository https://github.com/ofer-kimchi/LandscapeFold .


Asunto(s)
Algoritmos , ARN , Conformación de Ácido Nucleico , ARN/genética , Entropía , ADN de Cadena Simple
4.
Nucleic Acids Res ; 51(D1): D753-D759, 2023 01 06.
Artículo en Inglés | MEDLINE | ID: mdl-36477304

RESUMEN

The MGnify platform (https://www.ebi.ac.uk/metagenomics) facilitates the assembly, analysis and archiving of microbiome-derived nucleic acid sequences. The platform provides access to taxonomic assignments and functional annotations for nearly half a million analyses covering metabarcoding, metatranscriptomic, and metagenomic datasets, which are derived from a wide range of different environments. Over the past 3 years, MGnify has not only grown in terms of the number of datasets contained but also increased the breadth of analyses provided, such as the analysis of long-read sequences. The MGnify protein database now exceeds 2.4 billion non-redundant sequences predicted from metagenomic assemblies. This collection is now organised into a relational database making it possible to understand the genomic context of the protein through navigation back to the source assembly and sample metadata, marking a major improvement. To extend beyond the functional annotations already provided in MGnify, we have applied deep learning-based annotation methods. The technology underlying MGnify's Application Programming Interface (API) and website has been upgraded, and we have enabled the ability to perform downstream analysis of the MGnify data through the introduction of a coupled Jupyter Lab environment.


Asunto(s)
Microbiota , Análisis de Secuencia , Genómica/métodos , Metagenoma , Metagenómica/métodos , Microbiota/genética , Programas Informáticos , Análisis de Secuencia/métodos
5.
Database (Oxford) ; 20222022 08 12.
Artículo en Inglés | MEDLINE | ID: mdl-35961013

RESUMEN

Over the last 25 years, biology has entered the genomic era and is becoming a science of 'big data'. Most interpretations of genomic analyses rely on accurate functional annotations of the proteins encoded by more than 500 000 genomes sequenced to date. By different estimates, only half the predicted sequenced proteins carry an accurate functional annotation, and this percentage varies drastically between different organismal lineages. Such a large gap in knowledge hampers all aspects of biological enterprise and, thereby, is standing in the way of genomic biology reaching its full potential. A brainstorming meeting to address this issue funded by the National Science Foundation was held during 3-4 February 2022. Bringing together data scientists, biocurators, computational biologists and experimentalists within the same venue allowed for a comprehensive assessment of the current state of functional annotations of protein families. Further, major issues that were obstructing the field were identified and discussed, which ultimately allowed for the proposal of solutions on how to move forward.


Asunto(s)
Genómica , Proteínas , Secuencia de Bases , Biología Computacional , Genoma , Anotación de Secuencia Molecular
6.
Biophys J ; 121(16): 3023-3033, 2022 08 16.
Artículo en Inglés | MEDLINE | ID: mdl-35859421

RESUMEN

Collagen fibrils are the major constituents of the extracellular matrix, which provides structural support to vertebrate connective tissues. It is widely assumed that the superstructure of collagen fibrils is encoded in the primary sequences of the molecular building blocks. However, the interplay between large-scale architecture and small-scale molecular interactions makes the ab initio prediction of collagen structure challenging. Here, we propose a model that allows us to predict the periodic structure of collagen fibers and the axial offset between the molecules, purely on the basis of simple predictive rules for the interaction between amino acid residues. With our model, we identify the sequence-dependent collagen fiber geometries with the lowest free energy and validate the predicted geometries against the available experimental data. We propose a procedure for searching for optimal staggering distances. Finally, we build a classification algorithm and use it to scan 11 data sets of vertebrate fibrillar collagens, and predict the periodicity of the resulting assemblies. We analyzed the experimentally observed variance of the optimal stagger distances across species, and find that these distances, and the resulting fibrillar phenotypes, are evolutionary well preserved. Moreover, we observed that the energy minimum at the optimal stagger distance is broad in all cases, suggesting a further evolutionary adaptation designed to improve the assembly kinetics. Our periodicity predictions are not only in good agreement with the experimental data on collagen molecular staggering for all collagen types analyzed, but also for synthetic peptides. We argue that, with our model, it becomes possible to design tailor-made, periodic collagen structures, thereby enabling the design of novel biomimetic materials based on collagen-mimetic trimers.


Asunto(s)
Materiales Biomiméticos , Colágeno , Materiales Biomiméticos/química , Colágeno/metabolismo , Matriz Extracelular/metabolismo , Colágenos Fibrilares , Péptidos/química
7.
Nat Biotechnol ; 40(6): 932-937, 2022 06.
Artículo en Inglés | MEDLINE | ID: mdl-35190689

RESUMEN

Understanding the relationship between amino acid sequence and protein function is a long-standing challenge with far-reaching scientific and translational implications. State-of-the-art alignment-based techniques cannot predict function for one-third of microbial protein sequences, hampering our ability to exploit data from diverse organisms. Here, we train deep learning models to accurately predict functional annotations for unaligned amino acid sequences across rigorous benchmark assessments built from the 17,929 families of the protein families database Pfam. The models infer known patterns of evolutionary substitutions and learn representations that accurately cluster sequences from unseen families. Combining deep models with existing methods significantly improves remote homology detection, suggesting that the deep models learn complementary information. This approach extends the coverage of Pfam by >9.5%, exceeding additions made over the last decade, and predicts function for 360 human reference proteome proteins with no previous Pfam annotation. These results suggest that deep learning models will be a core component of future protein annotation tools.


Asunto(s)
Aprendizaje Profundo , Secuencia de Aminoácidos , Bases de Datos de Proteínas , Humanos , Anotación de Secuencia Molecular , Proteoma/metabolismo , Proteómica
8.
Cell Syst ; 12(11): 1019-1020, 2021 11 17.
Artículo en Inglés | MEDLINE | ID: mdl-34793698

RESUMEN

Machine-learning-guided protein design is rapidly emerging as a strategy to find high-fitness multi-mutant variants. In this issue of Cell Systems, Wittman et al. analyze the impact of design decisions for machine-learning-assisted directed evolution (MLDE) on its ability to navigate a fitness landscape and reliably find global optima.


Asunto(s)
Aprendizaje Automático , Proteínas
9.
Nat Biotechnol ; 39(6): 691-696, 2021 06.
Artículo en Inglés | MEDLINE | ID: mdl-33574611

RESUMEN

Modern experimental technologies can assay large numbers of biological sequences, but engineered protein libraries rarely exceed the sequence diversity of natural protein families. Machine learning (ML) models trained directly on experimental data without biophysical modeling provide one route to accessing the full potential diversity of engineered proteins. Here we apply deep learning to design highly diverse adeno-associated virus 2 (AAV2) capsid protein variants that remain viable for packaging of a DNA payload. Focusing on a 28-amino acid segment, we generated 201,426 variants of the AAV2 wild-type (WT) sequence yielding 110,689 viable engineered capsids, 57,348 of which surpass the average diversity of natural AAV serotype sequences, with 12-29 mutations across this region. Even when trained on limited data, deep neural network models accurately predict capsid viability across diverse variants. This approach unlocks vast areas of functional but previously unreachable sequence space, with many potential applications for the generation of improved viral vectors and protein therapeutics.


Asunto(s)
Proteínas de la Cápside/genética , Dependovirus/genética , Aprendizaje Automático , Vectores Genéticos , Células HeLa , Humanos
10.
Nat Biotechnol ; 38(8): 989-999, 2020 08.
Artículo en Inglés | MEDLINE | ID: mdl-32284585

RESUMEN

A central challenge in expanding the genetic code of cells to incorporate noncanonical amino acids into proteins is the scalable discovery of aminoacyl-tRNA synthetase (aaRS)-tRNA pairs that are orthogonal in their aminoacylation specificity. Here we computationally identify candidate orthogonal tRNAs from millions of sequences and develop a rapid, scalable approach-named tRNA Extension (tREX)-to determine the in vivo aminoacylation status of tRNAs. Using tREX, we test 243 candidate tRNAs in Escherichia coli and identify 71 orthogonal tRNAs, covering 16 isoacceptor classes, and 23 functional orthogonal tRNA-cognate aaRS pairs. We discover five orthogonal pairs, including three highly active amber suppressors, and evolve new amino acid substrate specificities for two pairs. Finally, we use tREX to characterize a matrix of 64 orthogonal synthetase-orthogonal tRNA specificities. This work expands the number of orthogonal pairs available for genetic code expansion and provides a pipeline for the discovery of additional orthogonal pairs and a foundation for encoding the cellular synthesis of noncanonical biopolymers.


Asunto(s)
Aminoacil-ARNt Sintetasas/metabolismo , ARN de Transferencia/metabolismo , Secuencia de Aminoácidos , Aminoacil-ARNt Sintetasas/genética , Simulación por Computador , Escherichia coli , Regulación Bacteriana de la Expresión Génica , Proteínas Fluorescentes Verdes , Unión Proteica , Especificidad por Sustrato
11.
Sci Rep ; 10(1): 3397, 2020 02 25.
Artículo en Inglés | MEDLINE | ID: mdl-32099005

RESUMEN

Collagen fibrils are central to the molecular organization of the extracellular matrix (ECM) and to defining the cellular microenvironment. Glycation of collagen fibrils is known to impact on cell adhesion and migration in the context of cancer and in model studies, glycation of collagen molecules has been shown to affect the binding of other ECM components to collagen. Here we use TEM to show that ribose-5-phosphate (R5P) glycation of collagen fibrils - potentially important in the microenvironment of actively dividing cells, such as cancer cells - disrupts the longitudinal ordering of the molecules in collagen fibrils and, using KFM and FLiM, that R5P-glycated collagen fibrils have a more negative surface charge than unglycated fibrils. Altered molecular arrangement can be expected to impact on the accessibility of cell adhesion sites and altered fibril surface charge on the integrity of the extracellular matrix structure surrounding glycated collagen fibrils. Both effects are highly relevant for cell adhesion and migration within the tumour microenvironment.


Asunto(s)
Colágeno Tipo I/química , Matriz Extracelular/química , Ribosamonofosfatos/química , Animales , Colágeno Tipo I/metabolismo , Matriz Extracelular/metabolismo , Glicosilación , Humanos , Ribosamonofosfatos/metabolismo
12.
Brief Bioinform ; 21(5): 1549-1567, 2020 09 25.
Artículo en Inglés | MEDLINE | ID: mdl-31626279

RESUMEN

Antibodies are proteins that recognize the molecular surfaces of potentially noxious molecules to mount an adaptive immune response or, in the case of autoimmune diseases, molecules that are part of healthy cells and tissues. Due to their binding versatility, antibodies are currently the largest class of biotherapeutics, with five monoclonal antibodies ranked in the top 10 blockbuster drugs. Computational advances in protein modelling and design can have a tangible impact on antibody-based therapeutic development. Antibody-specific computational protocols currently benefit from an increasing volume of data provided by next generation sequencing and application to related drug modalities based on traditional antibodies, such as nanobodies. Here we present a structured overview of available databases, methods and emerging trends in computational antibody analysis and contextualize them towards the engineering of candidate antibody therapeutics.


Asunto(s)
Anticuerpos Monoclonales/química , Anticuerpos Monoclonales/inmunología , Anticuerpos Monoclonales/uso terapéutico , Biología Computacional/métodos , Bases de Datos de Proteínas , Simulación del Acoplamiento Molecular , Conformación Proteica
13.
J Comput Biol ; 27(8): 1219-1231, 2020 08.
Artículo en Inglés | MEDLINE | ID: mdl-31874057

RESUMEN

In many application domains, neural networks are highly accurate and have been deployed at large scale. However, users often do not have good tools for understanding how these models arrive at their predictions. This has hindered adoption in fields such as the life and medical sciences, where researchers require that models base their decisions on underlying biological phenomena rather than peculiarities of the dataset. We propose a set of methods for critiquing deep learning models and demonstrate their application for protein family classification, a task for which high-accuracy models have considerable potential impact. Our methods extend the Sufficient Input Subsets (SIS) technique, which we use to identify subsets of features in each protein sequence that are alone sufficient for classification. Our suite of tools analyzes these subsets to shed light on the decision-making criteria employed by models trained on this task. These tools show that while deep models may perform classification for biologically relevant reasons, their behavior varies considerably across the choice of network architecture and parameter initialization. While the techniques that we develop are specific to the protein sequence classification task, the approach taken generalizes to a broad set of scientific contexts in which model interpretability is essential.


Asunto(s)
Biología Computacional , Modelos Biológicos , Familia de Multigenes/genética , Proteínas/clasificación , Aprendizaje Profundo , Humanos , Aprendizaje Automático , Redes Neurales de la Computación , Proteínas/genética
14.
Phys Rev Lett ; 123(23): 238102, 2019 Dec 06.
Artículo en Inglés | MEDLINE | ID: mdl-31868483

RESUMEN

Collagen consists of three peptides twisted together through a periodic array of hydrogen bonds. Here we use this as inspiration to find design rules for programmed specific interactions for self-assembling synthetic collagenlike triple helices, starting from disordered configurations. The assembly generically nucleates defects in the triple helix, the characteristics of which can be manipulated by spatially varying the enthalpy of helix formation. Defect formation slows assembly, evoking kinetic pathologies that have been observed to mutations in the primary collagen amino acid sequence. The controlled formation and interaction between defects gives a route for hierarchical self-assembly of bundles of twisted filaments.


Asunto(s)
Colágeno/química , Modelos Químicos , Secuencia de Aminoácidos , Modelos Moleculares , Nanoestructuras/química , Péptidos/química , Conformación Proteica en Hélice alfa
15.
Biophys J ; 117(3): 520-532, 2019 08 06.
Artículo en Inglés | MEDLINE | ID: mdl-31353036

RESUMEN

The accurate prediction of RNA secondary structure from primary sequence has had enormous impact on research from the past 40 years. Although many algorithms are available to make these predictions, the inclusion of non-nested loops, termed pseudoknots, still poses challenges arising from two main factors: 1) no physical model exists to estimate the loop entropies of complex intramolecular pseudoknots, and 2) their NP-complete enumeration has impeded their study. Here, we address both challenges. First, we develop a polymer physics model that can address arbitrarily complex pseudoknots using only two parameters corresponding to concrete physical quantities-over an order of magnitude fewer than the sparsest state-of-the-art phenomenological methods. Second, by coupling this model to exhaustive enumeration of the set of possible structures, we compute the entire free energy landscape of secondary structures resulting from a primary RNA sequence. We demonstrate that for RNA structures of ∼80 nucleotides, with minimal heuristics, the complete enumeration of possible secondary structures can be accomplished quickly despite the NP-complete nature of the problem. We further show that despite our loop entropy model's parametric sparsity, it performs better than or on par with previously published methods in predicting both pseudoknotted and non-pseudoknotted structures on a benchmark data set of RNA structures of ≤80 nucleotides. We suggest ways in which the accuracy of the model can be further improved.


Asunto(s)
Entropía , Conformación de Ácido Nucleico , Polímeros/química , ARN , Algoritmos , ARN/química , Termodinámica
16.
Proc Natl Acad Sci U S A ; 116(24): 11624-11629, 2019 06 11.
Artículo en Inglés | MEDLINE | ID: mdl-31127041

RESUMEN

Deep neural networks have achieved state-of-the-art accuracy at classifying molecules with respect to whether they bind to specific protein targets. A key breakthrough would occur if these models could reveal the fragment pharmacophores that are causally involved in binding. Extracting chemical details of binding from the networks could enable scientific discoveries about the mechanisms of drug actions. However, doing so requires shining light into the black box that is the trained neural network model, a task that has proved difficult across many domains. Here we show how the binding mechanism learned by deep neural network models can be interrogated, using a recently described attribution method. We first work with carefully constructed synthetic datasets, in which the molecular features responsible for "binding" are fully known. We find that networks that achieve perfect accuracy on held-out test datasets still learn spurious correlations, and we are able to exploit this nonrobustness to construct adversarial examples that fool the model. This makes these models unreliable for accurately revealing information about the mechanisms of protein-ligand binding. In light of our findings, we prescribe a test that checks whether a hypothesized mechanism can be learned. If the test fails, it indicates that the model must be simplified or regularized and/or that the training dataset requires augmentation.


Asunto(s)
Unión Proteica/fisiología , Proteínas/química , Algoritmos , Ligandos , Aprendizaje Automático , Modelos Químicos , Redes Neurales de la Computación
17.
Sci Rep ; 8(1): 13809, 2018 09 14.
Artículo en Inglés | MEDLINE | ID: mdl-30218106

RESUMEN

Fibrillar collagens have mechanical and biological roles, providing tissues with both tensile strength and cell binding sites which allow molecular interactions with cell-surface receptors such as integrins. A key question is: how do collagens allow tissue flexibility whilst maintaining well-defined ligand binding sites? Here we show that proline residues in collagen glycine-proline-hydroxyproline (Gly-Pro-Hyp) triplets provide local conformational flexibility, which in turn confers well-defined, low energy molecular compression-extension and bending, by employing two-dimensional 13C-13C correlation NMR spectroscopy on 13C-labelled intact ex vivo bone and in vitro osteoblast extracellular matrix. We also find that the positions of Gly-Pro-Hyp triplets are highly conserved between animal species, and are spatially clustered in the currently-accepted model of molecular ordering in collagen type I fibrils. We propose that the Gly-Pro-Hyp triplets in fibrillar collagens provide fibril "expansion joints" to maintain molecular ordering within the fibril, thereby preserving the structural integrity of ligand binding sites.


Asunto(s)
Colágeno/química , Colágeno/metabolismo , Prolina/metabolismo , Secuencia de Aminoácidos , Aminoácidos/metabolismo , Animales , Femenino , Colágenos Fibrilares/metabolismo , Colágenos Fibrilares/fisiología , Glicina/química , Hidroxiprolina/química , Espectroscopía de Resonancia Magnética , Ratones , Ratones Endogámicos C57BL , Osteoblastos/metabolismo , Péptidos/química , Prolina/fisiología , Conformación Proteica , Ovinos
18.
Protein Eng Des Sel ; 31(7-8): 267-275, 2018 07 01.
Artículo en Inglés | MEDLINE | ID: mdl-30053276

RESUMEN

Nanobodies (Nbs) are a class of antigen-binding protein derived from camelid immune systems, which achieve equivalent binding affinities and specificities to classical antibodies (Abs) despite being comprised of only a single variable domain. Here, we use a data set of 156 unique Nb:antigen complex structures to characterize Nb-antigen binding and draw comparison to a set of 156 unique Ab:antigen structures. We analyse residue composition and interactions at the antigen interface, together with structural features of the paratopes of both data sets. Our analysis finds that the set of Nb structures displays much greater paratope diversity, in terms of the structural segments involved in the paratope, the residues used at these positions to contact the antigen and furthermore the type of contacts made with the antigen. Our findings suggest a different relationship between contact propensity and sequence variability from that observed for Ab VH domains. The distinction between sequence positions that control interaction specificity and those that form the domain scaffold is much less clear-cut for Nbs, and furthermore H3 loop positions play a much more dominant role in determining interaction specificity.


Asunto(s)
Antígenos/inmunología , Anticuerpos de Cadena Única/inmunología , Secuencia de Aminoácidos , Animales , Especificidad de Anticuerpos , Cristalografía por Rayos X , Modelos Moleculares , Conformación Proteica , Anticuerpos de Cadena Única/química
19.
Proteins ; 86(7): 697-706, 2018 07.
Artículo en Inglés | MEDLINE | ID: mdl-29569425

RESUMEN

Nanobodies are a class of antigen-binding protein derived from camelids that achieve comparable binding affinities and specificities to classical antibodies, despite comprising only a single 15 kDa variable domain. Their reduced size makes them an exciting target molecule with which we can explore the molecular code that underpins binding specificity-how is such high specificity achieved? Here, we use a novel dataset of 90 nonredundant, protein-binding nanobodies with antigen-bound crystal structures to address this question. To provide a baseline for comparison we construct an analogous set of classical antibodies, allowing us to probe how nanobodies achieve high specificity binding with a dramatically reduced sequence space. Our analysis reveals that nanobodies do not diversify their framework region to compensate for the loss of the VL domain. In addition to the previously reported increase in H3 loop length, we find that nanobodies create diversity by drawing their paratope regions from a significantly larger set of aligned sequence positions, and by exhibiting greater structural variation in their H1 and H2 loops.


Asunto(s)
Anticuerpos , Anticuerpos de Dominio Único , Anticuerpos/química , Anticuerpos/genética , Anticuerpos/inmunología , Especificidad de Anticuerpos , Sitios de Unión de Anticuerpos , Modelos Moleculares , Conformación Proteica , Alineación de Secuencia , Anticuerpos de Dominio Único/química , Anticuerpos de Dominio Único/genética , Relación Estructura-Actividad
20.
Curr Opin Struct Biol ; 49: 123-128, 2018 04.
Artículo en Inglés | MEDLINE | ID: mdl-29452923

RESUMEN

Data driven computational approaches to predicting protein-ligand binding are currently achieving unprecedented levels of accuracy on held-out test datasets. Up until now, however, this has not led to corresponding breakthroughs in our ability to design novel ligands for protein targets of interest. This review summarizes the current state of the art in this field, emphasizing the recent development of deep neural networks for predicting protein-ligand binding. We explain the major technical challenges that have caused difficulty with predicting novel ligands, including the problems of sampling noise and the challenge of using benchmark datasets that are sufficiently unbiased that they allow the model to extrapolate to new regimes.


Asunto(s)
Biología Computacional , Ligandos , Aprendizaje Automático , Modelos Estadísticos , Proteínas/química , Relación Estructura-Actividad Cuantitativa , Algoritmos , Biología Computacional/métodos , Humanos , Unión Proteica
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...